Goto

Collaborating Authors

 time accuracy tradeoff


Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals

Neural Information Processing Systems

We consider the problem of computing the best-fitting ReLU with respect to square-loss on a training set when the examples have been drawn according to a spherical Gaussian distribution (the labels can be arbitrary). Let $\opt < 1$ be the population loss of the best-fitting ReLU. We prove: \begin{itemize} \item Finding a ReLU with square-loss $\opt + \epsilon$ is as hard as the problem of learning sparse parities with noise, widely thought to be computationally intractable. This is the first hardness result for learning a ReLU with respect to Gaussian marginals, and our results imply --{\em unconditionally}-- that gradient descent cannot converge to the global minimum in polynomial time.


Reviews: Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals

Neural Information Processing Systems

This paper studies the computational complexity of learning a single ReLU with respect to Gaussian examples. Since ReLUs are now the standard choice of nonlinearity in deep neural networks, the computational complexity of learning them is clearly of interest. Of course, the computational complexity of learning a ReLU may depend substantially on the specific setting assumed; it is interesting to understand the range of such assumptions and their implications for complexity. This paper studies the following setting: given independent samples (x_1,y_1), ..., (x_n,y_n) where x is spherical Gaussian in d dimensions and y \in R is arbitrary, find a ReLU function f_w(x) max(0, w \cdot x) for some vector w with minimal mean squared error \sum (y_i - f(x_i)) 2. (This is agnostic learning since the y's are arbitrary.) The main results are as follows: 1) There is no algorithm to learn a single ReLU with respect to Gaussian examples to additive error \epsilon in time d {o(log 1/\epsilon)} unless k -sparse parities with noise can be learned in time d {o(k)} 2) If opt min_{w} (mean squared error of f) then (with normalization such that opt \in [0,1]) there is an algorithm which agnostically learns a ReLU to error opt {2/3} \epsilon in time poly(d,1/\epsilon). The proof of (1) goes via Hermite analysis (i.e.


Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals

Neural Information Processing Systems

We consider the problem of computing the best-fitting ReLU with respect to square-loss on a training set when the examples have been drawn according to a spherical Gaussian distribution (the labels can be arbitrary). Let \opt 1 be the population loss of the best-fitting ReLU. We prove: \begin{itemize} \item Finding a ReLU with square-loss \opt \epsilon is as hard as the problem of learning sparse parities with noise, widely thought to be computationally intractable. This is the first hardness result for learning a ReLU with respect to Gaussian marginals, and our results imply --{\em unconditionally}-- that gradient descent cannot converge to the global minimum in polynomial time. The algorithm uses a novel reduction to noisy halfspace learning with respect to 0/1 loss.


Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals

Goel, Surbhi, Karmalkar, Sushrut, Klivans, Adam

Neural Information Processing Systems

We consider the problem of computing the best-fitting ReLU with respect to square-loss on a training set when the examples have been drawn according to a spherical Gaussian distribution (the labels can be arbitrary). Let $\opt 1$ be the population loss of the best-fitting ReLU. We prove: \begin{itemize} \item Finding a ReLU with square-loss $\opt \epsilon$ is as hard as the problem of learning sparse parities with noise, widely thought to be computationally intractable. This is the first hardness result for learning a ReLU with respect to Gaussian marginals, and our results imply --{\em unconditionally}-- that gradient descent cannot converge to the global minimum in polynomial time. The algorithm uses a novel reduction to noisy halfspace learning with respect to $0/1$ loss.